N-gram distribution based language model adaptation
نویسندگان
چکیده
This paper presents two techniques for language model (LM) adaptation. The first aims to build a more general LM. We propose a distribution-based pruning of n-gram LMs, where we prune n-grams that are likely to be infrequent in a new document. Experimental results show that the distribution-based pruning method performed up to 9% (word perplexity reduction) better than conventional cutoff methods. Moreover, the pruning method results in a more general ngram backoff model, in spite of the domain, style, or temporal bias in the training data. The second aims to build a more task-specific LM. We propose an n-gram distribution adaptation method for LM training. Given a large set of out-of-task training data, called training set, and a small set of task-specific training data, called seed set, we adapt the LM towards the task by adjusting the n-gram distribution in the training set to that in the seed set. Experimental results show non-trivial improvements over conventional methods.
منابع مشابه
Bilingual-LSA Based LM Adaptation for Spoken Language Translation
We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, in...
متن کاملClass-based language model adaptation using mixtures of word-class weights
This paper describes the use of a weighted mixture of classbased n-gram language models to perform topic adaptation. By using a fixed class n-gram history and variable word-given-class probabilities we obtain large improvements in the performance of the class-based language model, giving it similar accuracy to a word n-gram model, and an associated small but statistically significant improvemen...
متن کاملParagraph vector based topic model for language model adaptation
Topic model is an important approach for language model (LM) adaptation and has attracted research interest for a long time. Latent Dirichlet Allocation (LDA), which assumes generative Dirichlet distribution with bag-of-word features for hidden topics, has been widely used as the state-of-the-art topic model. Inspired by recent development of a new paradigm of distributed paragraph representati...
متن کاملLemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages
We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as b...
متن کاملRapid Unsupervised Topic Adaptation – a Latent Semantic Approach
In open-domain language exploitation applications, a wide variety of topics with swift topic shifts has to be captured. Consequently, it is crucial to rapidly adapt all language components of a spoken language system. This thesis addresses unsupervised topic adaptation in both monolingual and crosslingual settings. For automatic speech recognition we rapidly adapt a language model on a source l...
متن کامل